PART ONE

DOMAIN: Telecom

CONTEXT: A telecom company wants to use their historical customer data to predict behaviour to retain customers. You can analyse all relevant customer data and develop focused customer retention programs.

DATA DESCRIPTION: Each row represents a customer, each column contains customer’s attributes described on the column Metadata. The data set includes information about:

• Customers who left within the last month – the column is called Churn

• Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies

• Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges

• Demographic info about customers – gender, age range, and if they have partners and dependents

PROJECT OBJECTIVE: Build a model that will help to identify the potential customers who have a higher probability to churn. This help the company to understand the pinpoints and patterns of customer churn and will increase the focus on strategising customer retention.

ANSWER:

Objectives :

Here our objective is to identify podential customers who have higher probability to churn via different models and analysis.     
Telecom companies often use customer attrition analysis and customer attrition rates as one of their key business metrics because the cost of retaining an existing customer is far less than acquiring a new one. Companies from these sectors often have customer service branches which attempt to win back defecting clients, because recovered long-term customers can be worth much more to a company than newly recruited clients.

Companies usually make a distinction between voluntary churn and involuntary churn. Voluntary churn occurs due to a decision by the customer to switch to another company or service provider, involuntary churn occurs due to circumstances such as a customer's relocation to a long-term care facility, death, or the relocation to a distant location. In most applications, involuntary reasons for churn are excluded from the analytical models. Analysts tend to concentrate on voluntary churn, because it typically occurs due to factors of the company-customer relationship which companies control, such as how billing interactions are handled or how after-sales help is provided.

We need to derive pattern out of data given to predict pattern of churn which in turns determine the potential customer to focus on.

1. Import and warehouse data:

• Import all the given datasets. Explore shape and size.

• Merge all datasets onto one and explore final shape and size.

Importing Data and Exploring

df1 has 7043 row and 10 columns

df1 has 7043 row and 11 columns

ASSUMPTION:

Since we dont have any column in common and we dont have any customer ID common, we are assuming row in data 1 corresponds to row in data 2

Now we have same number of rows(7043) but 21 columns

2. Data cleansing:

• Missing value treatment

• Convert categorical attributes to continuous using relevant functional knowledge

• Drop attribute/s if required using relevant functional knowledge

• Automate all the above steps

There are 11 missing values for Total Charges. Lets remove these 11 rows from our data set

• Convert categorical attributes to continuous using relevant functional knowledge

• Automate all the above steps

3. Data analysis & visualisation:

• Perform detailed statistical analysis on the data.

• Perform a detailed univariate, bivariate and multivariate analysis with appropriate detailed comments after each analysis.

1. Here we can see that for tenure mean and median have very less difference, hence the distribution is quite normal

2. For MonthlyCharges mean is slightly lesser than median

3. For TotalCharges we can see that the data is not scaled and mean and median has difference. As mean > median, positive skewness exists

Finding unique items for each columns

gender : Gender is almost equally distributed, hence we can say that we have data available for both categories euqally.

SeniorCitizen : Here we have more data non-senior citizen data.

Partner : Partner is almost equally distributed

Dependents : Dependents data have more data on non-dependent category

PhoneService : There are more customer with phone services

MultipleLines : Customers without multiple lines are maximum

InternetService : Customer with fiber optic internet sevice are maximum

OnlineSecurity : Customer with No online security are highest

OnlineBackup : Customer with No online backup are highest

DeviceProtection : Customer with No device protection are highest

TechSupport : Customer with No tech support are highest

StreamingTV : Customer with No streaming tv are highest

StreamingMovies : Customer with No streaming movies are highest

Contract : Customer with month-to-month are highest

PaperlessBilling : Customer with paper billing are highest

PaymentMethod : Customer with electronic checks are highest

We can infer that many of the existing customers are not using the services and data for users not using services are more. As we see customer with not using multiple services are highest for many of the columns

We can explore the data visually now with various statistical analysis

Univariate Analysis

There is one property in the dataset that contains discrete values; class. The chart types we can use for the single discrete value distribution are; countplot(which is pandas bar graph) and percentage distribution.

We can see that the churn data is not even. 73.4% percent are non-churn data and 26.6% are churn data.

Here we have lesser churn data which is not good as we are predicting what factors leads to churn.

This is important to keep in mind for our modelling as skeweness could lead to a lot of false negatives. We will see in the modelling section on how to avoid skewness in the data.

Here data are equally distributed for Male and female.

Here we have more young people than senior citizen data. We have 16.2% senior citizen data

Here we can see that data is almost equally distributed.

Here we have more data on customer without dependents 70.2%

The data shows lot of people prefer phone service as 90.3% have opted for phone services.

Here we can see that the customer with multiple lines are 42.2%, while customer with out multiple lines are 48.1% and 9.7 % customer have no phone service.

44% customer have fiber optic internet service, 34.4% customer have DSL and 21.6% customer have No internet service.

Here we can see that 28.7% customer have online security, 49.7% customer have No online security and 21.6% customer have No internet service

Here we can see that 34.5% customer have online backup, 21.6% customer have No internet services and 43.9% customer have No online backup

Here we can see that 34.4% customer have Device protection, 44% customer have No Device protection and 21.6% customer have No internet service

Here we can see that 29% customer have tech support, 49.4% customer have No tech support and 21.6% customer have No internet service

Here we can see that 38.4% customer have streaming tv, 39.9% customer have No streaming tv and 21.6% customer have No internet service

Here we can see that 38.8% customer have streaming movies, 39.5% customer have No streaming movies and 21.6% customer have No internet service

20.9% customer have one year contract, 24% customer have 2 year contract and 55.1% customer have monthly

59.3 % people have paperless billing

We have customer paying with electronic check are highest in number. While Bank transfer, mailed check and credit card have almost equal distribution

Continuous Data univariate analysis

After looking at the below histogram we can see that a lot of customers have been with the telecom company for just a month, while quite a many are there for about 72 months. This could be potentially because different customers have different contracts. Thus based on the contract they are into it could be more/less easier for the customers to stay/leave the telecom company.

Here the data have two humps and distribution is not normal. We dont see any outliers here.

Distribution is not normal. We dont see any outliers here.

Distribution seems normal with huge positive skewness is present. We dont see any outliers here.

Bi Variate Analysis

New customers are more likely to churn. Here we have few outliers for churn Yes.Hence we can say that few older customers are also leaving

Customers with higher Monthly Charges are also more likely to churn

Customers with low total charges are churning more. We have few outliers for churn yes category.

Numerical Data Pairplot

There is some linearity between Tenure and total charges. But its non-related data as tenure and monthly charges are different attributes.Its coincidental

There is some correlation between total charges and monthly charges.Tenure and total charges are correlated but they are different attributes.

As we can see in the heatmap there is some collinearity between tenure - total charges, contact -tenure, monthly charges and total charges

Multivariate Analysis

Here we can see the disribution of totalCharges for different combinations of categorical values with churn as criteria.

This helps in understanding the distribution among different categorical values.

We can see there is noise for no-internet service in many category which is not good.And that particular category will be seen as noise by many of the model.This can be added in improvement section.

Since there is not much correlation between the different columns we can skip the hypothesis testing

4. Data pre-processing:

• Segregate predictors vs target attributes

• Check for target balancing and fix it if found imbalanced.

• Perform train-test split.

• Check if the train and test data have similar statistical characteristics when compared with original data.

• Segregate predictors vs target attributes

1. x: features

2. y: target variables (normal,type_h,type_s)

Checking outliers

We have 73.4% of churn no data and only 26.6% of churn yes data which makes it highly skewed data

This is important to keep in mind for our modelling as skeweness could lead to a lot of false negatives.

• Perform train-test split.

5. Model training, testing and tuning:

• Train and test all ensemble models taught in the learning module.

• Suggestion: Use standard ensembles available. Also you can design your own ensemble technique using weak classifiers. 

• Display the classification accuracies for train and test data.

• Apply all the possible tuning techniques to train the best model for the given data.

• Suggestion: Use all possible hyper parameter combinations to extract the best accuracies. 

• Display and compare all the models designed with their train and test accuracies.

• Select the final best trained model along with your detailed comments for selecting this model.

• Pickle the selected model for future use.

Decision Tree Analysis

Train data Accuracy is very high but Test data accury is low, this is due to the overfitting of training data. We can restrict the depth of tree for avoid overfitting

We can see that precision and recall for churn=1 here are very bad 0.47 and 0.5. We are interested in predicting churn yes, hence the model is not good at all.

weighted recall, f1 and precision are better due to high values for churn=0. While macro avg helps in pointing out that the low scores are due to lower values for churn =1

Hence, we need to use sampled data.

Decision Tree with sampled data Analysis

Here we cans see that there is increase in test data accuracy which is good. Next we can restrict the depth of tree for avoid overfitting

Here we can see that there is increase in recall precision and f1 score after sampling for churn =1 . We are interested in predicting churn yes, hence the model is better than unsampled data.

We can see the balanced macro and weighted avg here. This values is better than un-sampled data

Next we will try to remove the overfitting of training data by restricting the depth of the tree

NOTE: from here onward we will be using sampled data only. As the unsampled data is given biased error and low recall values for the churn =1, which is bad as we are trying to analyse churn=1 data here to prevent customer from leaving

Decision Tree with sampled data and restricted depth Analysis

Here we cans see that there is increase in test data accuracy which is good.

Here we can see that there is increase in recall precision and f1 score. The scores are balanced for churn=1 and 0 which mean the model is equally good in predicting both 0 and 1 values

We can see the balanced macro and weighted avg here.

Here we can see that the columns like MonthlyCharges, TotalCharges and Contract have very high importance score.

Bagging Classifier Analysis

Here we can see that there is increase in test data accuracy to 83% which is good.

Here we can see that there is increase in recall precision and f1 score. The scores are balanced for churn=1 and 0 which mean the model is equally good in predicting both 0 and 1 values

We can see the balanced macro and weighted avg here.

All the cores are better than previous models analysed till now.

AdaBoostClassifier Analysis

Here we can see that there is slight decrease in test data accuracy to 80.34%

Here we can see that there is decrease in recall precision and f1 score. The scores are balanced for churn=1 and 0 which mean the model is equally good in predicting both 0 and 1 values

We can see the balanced macro and weighted avg here.

GradientBoostingClassifier Analysis

Here we can see that test data accuracy to 82.24% is better than AdaBoostClassifier but lesser Bagging Classifier

The scores are balanced for churn=1 and 0 which mean the model is equally good in predicting both 0 and 1 values

We can see the balanced macro and weighted avg here.

RandomForestClassifier Analysis

Here we can see that there is increase in test data accuracy to 84.05% which is good.

Here we can see that there is increase in recall precision and f1 score. The scores are balanced for churn=1 and 0 which mean the model is equally good in predicting both 0 and 1 values

We can see the balanced macro and weighted avg here.

All the cores are better than previous models analysed till now.

Design your own ensemble technique using weak classifiers.

Above we have already done the Homogeneous ensemble models analysis

Will be using stacking and voting for few of above classifiers like knn, random forest and gradientBoostingClassifiers

VotingClassifier Analysis

Here we can see that there is very sligh increase in test data accuracy to 84.09% which is good.

Here we can see that there is no change in recall precision and f1 score. The scores are balanced for churn=1 and 0 which mean the model is equally good in predicting both 0 and 1 values

We can see the balanced macro and weighted avg here.

StackingClassifier Analysis

Here we can see that there is slight increase in test data accuracy to 84.51% which is good.

Here we can see that there is no change in recall precision and f1 score. The scores are balanced for churn=1 and 0 which mean the model is equally good in predicting both 0 and 1 values

We can see the balanced macro and weighted avg here.

HYPERTUNING THE MODEL

Here we can see that after the tunning the best accuracy is around 83.51

NOTE: We are using less parameters to hypertuning as the execution time is increasing.

CROSS-VALIDATION ACCUMULATED RESULT

CROSS-VALIDATION ACCUMULATED ANALYSIS

Here we can see that the Random forest, hypertuning GridSearchCV, VotingClassifier and Stacking Classifier have almost similar IQR range for cross validation results.

Random Classifier has the highest mean accuracy though it has high IQR. But the range of IQR is similar to the top performing models

Among Random forest, hypertuning GridSearchCV, VotingClassifier and Stacking Classifier, having highest accuracy, Random forest has better result than others in term of 25 and 75 percentile range.

We can say that Random forest has more consistent accuracy and have almost similar accuracy amount best performing models

Model Comparision and Analysis

RandomForestClassifier, VotingClassifier and StackingClassifier have highest accuracy.

Among models with high accuracy RandomForestClassifier, VotingClassifier and StackingClassifier have better error parameters.

Cross validation mean and standard deviation are good for RandomForestClassifier, VotingClassifier and StackingClassifier

Precision recall and f1 score are best for RandomForestClassifier, VotingClassifier and StackingClassifier

Since Random forest has lesser execution time and more stable compare to VotingClassifier and StackingClassifier. We can consider Random forest for future.

Conclusion and improvisation:

• Write your conclusion on the results.

• Detailed suggestions or improvements or on quality, quantity, variety, velocity, veracity etc. on the data points collected by the telecom operator to perform a better data analysis in future.

Write your conclusion on the results.

Reference:

Precision: When it predicts the positive result, how often is it correct? i.e. limit the number of false positives.

Recall: When it is actually the positive result, how often does it predict correctly? i.e. limit the number of false negatives.

f1-score: Harmonic mean of precision and recall.

Accuracy R Squared MSE MSLE CV mean CV standard deviation Precision Recall f1
DecisionTreeClassifier 0.718483 -0.452264 0.530581 0.367771 0.785323 0.069430 0.467005 0.497297 0.481675
DecisionTreeClassifier 0.773079 0.092081 0.476362 0.330189 0.785323 0.069430 0.761299 0.784777 0.772859
DecisionTreeClassifier 0.791156 0.164405 0.456995 0.316765 0.787251 0.044139 0.777006 0.807087 0.791761
BaggingClassifier 0.829890 0.319384 0.412444 0.285884 0.838105 0.074956 0.832555 0.818898 0.825670
AdaBoostClassifier 0.803422 0.213481 0.443372 0.307322 0.796448 0.034783 0.781885 0.832677 0.806482
GradientBoostingClassifier 0.822466 0.289679 0.421348 0.292056 0.818436 0.042157 0.803995 0.845144 0.824056
RandomForestClassifier 0.840542 0.362003 0.399322 0.276789 0.839073 0.072814 0.843333 0.830052 0.836640
knn 0.759199 0.036547 0.490714 0.340137 0.780170 0.014424 0.729363 0.811680 0.768323
VotingClassifier 0.840865 0.363294 0.398917 0.276508 0.841195 0.048046 0.826472 0.856299 0.841121
StackingClassifier 0.845061 0.380084 0.393622 0.272838 0.845846 0.055367 0.845695 0.837927 0.841793

Decision Tree Analysis

Train data Accuracy is very high but Test data accury is low, this is due to the overfitting of training data. We can restrict the depth of tree for avoid overfitting

We can see that precision and recall for churn=1 here are very bad 0.47 and 0.5. We are interested in predicting churn yes, hence the model is not good at all.

weighted recall, f1 and precision are better due to high values for churn=0. While macro avg helps in pointing out that the low scores are due to lower values for churn =1

Hence, we need to use sampled data.

Decision Tree with sampled data Analysis

Here we cans see that there is increase in test data accuracy which is good. Next we can restrict the depth of tree for avoid overfitting

Here we can see that there is increase in recall precision and f1 score after sampling for churn =1 . We are interested in predicting churn yes, hence the model is better than unsampled data.

We can see the balanced macro and weighted avg here. This values is better than un-sampled data

Next we will try to remove the overfitting of training data by restricting the depth of the tree

Decision Tree with sampled data and restricted depth Analysis

Here we cans see that there is increase in test data accuracy which is good.

Here we can see that there is increase in recall precision and f1 score. The scores are balanced for churn=1 and 0 which mean the model is equally good in predicting both 0 and 1 values

We can see the balanced macro and weighted avg here.

Here we can see that the columns like MonthlyCharges, TotalCharges and Contract have very high importance score.

Bagging Classifier Analysis

Here we can see that there is increase in test data accuracy to 83% which is good.

Here we can see that there is increase in recall precision and f1 score. The scores are balanced for churn=1 and 0 which mean the model is equally good in predicting both 0 and 1 values

We can see the balanced macro and weighted avg here.

All the cores are better than previous models analysed till now.

AdaBoostClassifier Analysis

Here we can see that there is slight decrease in test data accuracy to 80.34%

Here we can see that there is decrease in recall precision and f1 score. The scores are balanced for churn=1 and 0 which mean the model is equally good in predicting both 0 and 1 values

We can see the balanced macro and weighted avg here.

GradientBoostingClassifier Analysis

Here we can see that test data accuracy to 82.24% is better than AdaBoostClassifier but lesser Bagging Classifier

The scores are balanced for churn=1 and 0 which mean the model is equally good in predicting both 0 and 1 values

We can see the balanced macro and weighted avg here.

RandomForestClassifier Analysis

Here we can see that there is increase in test data accuracy to 84.05% which is good.

Here we can see that there is increase in recall precision and f1 score. The scores are balanced for churn=1 and 0 which mean the model is equally good in predicting both 0 and 1 values

We can see the balanced macro and weighted avg here.

All the cores are better than previous models analysed till now.

Design your own ensemble technique using weak classifiers.

Above we have already done the Homogeneous ensemble models analysis

Will be using stacking and voting for few of above classifiers like knn, random forest and gradientBoostingClassifiers

VotingClassifier Analysis

Here we can see that there is very sligh increase in test data accuracy to 84.09% which is good.

Here we can see that there is no change in recall precision and f1 score. The scores are balanced for churn=1 and 0 which mean the model is equally good in predicting both 0 and 1 values

We can see the balanced macro and weighted avg here.

StackingClassifier Analysis

Here we can see that there is slight increase in test data accuracy to 84.51% which is good.

Here we can see that there is no change in recall precision and f1 score. The scores are balanced for churn=1 and 0 which mean the model is equally good in predicting both 0 and 1 values

We can see the balanced macro and weighted avg here.

HYPERTUNING MODEL ANALYSIS

The best accuracy after hypertuning model is around 83.51

CROSS-VALIDATION ACCUMULATED ANALYSIS

Here we can see that the Random forest, hypertuning GridSearchCV, VotingClassifier and Stacking Classifier have almost similar IQR range for cross validation results.

Random Classifier has the highest mean accuracy though it has high IQR. But the range of IQR is similar to the top performing models

Among Random forest, hypertuning GridSearchCV, VotingClassifier and Stacking Classifier, having highest accuracy, Random forest has better result than others in term of 25 and 75 percentile range.

We can say that Random forest has more consistent accuracy and have almost similar accuracy amount best performing models

Model Comparision and Analysis

RandomForestClassifier, VotingClassifier and StackingClassifier have highest accuracy.

Among models with high accuracy RandomForestClassifier, VotingClassifier and StackingClassifier have better error parameters.

Cross validation mean and standard deviation are good for RandomForestClassifier, VotingClassifier and StackingClassifier

Precision recall and f1 score are best for RandomForestClassifier, VotingClassifier and StackingClassifier

Since Random forest has lesser execution time and more stable compare to VotingClassifier and StackingClassifier. We can consider Random forest for future.

• Detailed suggestions or improvements or on quality, quantity, variety, velocity, veracity etc. on the data points collected by the telecom operator to perform a better data analysis in future.

SUGGESTIONS

We have no common id on the both set of separate files provided, we are assuming the data on both the files are in order of index or customer ID. No customer ID in file 2

We have more data on churn = no and lesser on churn =1. Since we are trying to predict churn=1 we should collect and focus more on getting churn data for better analysis and results

There are few data with high correlation

High correlated continuous data like monthly and total charges.

Data are highly imbalanced for churn, senior citizen, dependents and Phone service

Many categorical columns has 3rd category as "No Internet service" which is treated as noise and often misleading. This leads for bad modelling and prediction.

===========================================================================================

END